Virginia Tech – Professional Analysts using a
Large, High-Resolution Display
VAST 2009 Challenge
Challenge 1: Badge and Network Traffic
Authors and Affiliations:
Alex Endert, Virginia
Tech, aendert@cs.vt.edu [PRIMARY contact]
Christopher Andrews, Virginia Tech, cpa@cs.vt.edu
Chris North, Virginia Tech, north@cs.vt.edu
[Faculty advisor]
Tool(s): (Short Answer)
Our
team’s goal was to observe how professional analysts solve this challenge using
a visualization on a large, high-resolution display. As we worked in
conjunction with the team developing the dataset, we were aware of the
solution, so our goal was not to solve the task, but to observe the process
taken by professional analysts attempting to solve this challenge. We are not submitting to this contest in the traditional manner, but
rather taking a different approach to this challenge by highlighting the
processes professional cyber analysts take to solve a challenge such as this.
Figure 1. The large, high-resolution display,
arranged in a curved setup, and totaling nearly 33 megapixels.
Rather
than equipping the analysts with special purpose cyber-analytic tools, we
provided Microsoft Excel to display and manipulate the raw data and the general-purpose
visualization tool, Spotfire (http://www.spotfire.com). In addition, all
analysis was performed using a large, high-resolution display running Windows
XP. The display consists of eight 30-inch LCD panels, tiled in a 4x2
configuration totaling nearly 33 megapixels (Figure 1).
The analysts were able to display all of the information relevant to the
challenge without minimizing any windows due to space constraints. In addition,
they also had the ability to view and interact with the visualization when
enlarged to span all eight screens (Figure 2).
This meant that they could physically navigate to gain an overview of the
dataset, examine details, switch tasks, and rapidly consult multiple views and
tools. The ability to solve this challenge visually was largely due to the use
of the large, high-resolution display, as it allowed for multiple views of the
data, as well as the persistence of the entirety of the dataset (e.g. the map,
prox data, network data, and employee IP list). Also, when the visualization
was enlarged, it allowed for enough detail to where a minimal amount of data
was hidden due to aggregation techniques used with smaller displays.
Figure 2. A screenshot of the visualization showing
the combined prox and IP data maximized to all displays. An enlargement shows
the visualization to scale. Color was used to encoded the state of the person
(red: in the building, blue: in classified, yellow: proxed out of classified).
Video:
Video
Link:
vastvid.mov
ANSWERS:
MC1.1: Identify which
computer(s) the employee most likely used to send information to his contact in
a tab-delimited table which contains for each computer identified: when the
information was sent, how much information was sent and where that information
was sent.
MC1.2:
Characterize the patterns of behavior of suspicious computer use. (Detailed
Answer)
Our team was able to work in collaboration with the group
creating the VAST 2009 dataset, so
we are not submitting to this contest in the traditional manner, but rather taking
a different approach to this challenge by highlighting the processes
professional cyber analysts take to solve a challenge such as this. Our
involvement in the creation of the dataset revealed the solution to us,
allowing us to guide the analysts through the task whenever needed. The
decision to perform such a guided study was to accelerate the investigation so
it will fit in a two-hour session, as well as discover cyber analysts’
investigation processes. Our participants consisted of four professional cyber
analysts from a large government laboratory.
Each analyst was given a two-hour session to solve this
challenge. We captured their progress by use of a video recording and an
automated screenshot taken every minute. We followed the study with an
interview where we asked a series of questions regarding their experience, as
well as their typical workspaces and tools.
Our guidance of the analysts can be categorized as three
main types. First, we would periodically encourage the analysts to use the
visualization. We observed three of the four analysts heavily favoring the use
of “the raw data” in Excel, and would suggest to them using the visualization in
conjunction with Excel. We would remind the analysts that in order to solve
this challenge, one may need to make use of all of the data provided, as well
as establish relationships between the separate parts. Second, we provided a
quick reference for them to ask questions about the challenge. Due to the
two-hour time limit, they would ask us questions such as the ones normally
posted on the VAST Challenge Discussion Blog – mainly pertaining to the nature
of the dataset, and assumptions that can or cannot be made. Third, when we observed
an analyst spending a large amount of time pursuing a “dead end”, we would inform
them to stop pursuing this aspect of the data in the interest of time. For
instance, some analysts would search online repositories of “bad IPs” to see if
any match the IPs of this challenge. We informed them the IPs in this study are
not “real”, therefore this approach would not work. Other times,
Prior to the studies,
we took the time to combining the IPLog data and the prox data in Excel, which
took nearly two hours. However, for a more experienced Excel user, the
combination of this data would take significantly less time. Due to this time
consuming aspect, as the analysts showed interest in combining this data during
their investigation, we gave them our pre-combined data (both in Spotfire and
Excel) due to time constraints.
During the initial stages of each trial, it became clear
that each analyst has their own personal set of queries and approaches to
solving such a challenge. Each analyst started the study by performing a series
of premeditated searches and questions based on their prior domain knowledge.
These included queries on specific IPs, sorting by largest packets, creating
pivot tables in Excel to highlight unique IP-to-IP connections, and more. Also,
their background strongly dictated the tools they used. For instance, one
analyst was very skilled in Excel, therefore the majority of her work was done
creating different views of the data within Excel. This proved to be
problematic due to the majority of her work and interactions with the data not
being able to be captured in an easy way. At times, she resorted to saving
versions of the Excel file in order to maintain a “working state” of the data,
from where she would further explore other directions of her investigation.
The other analysts mainly worked back and forth between
the visualization and the data in Excel, with one analyst doing the majority of
his work within Spotfire. We believe this occurred due to his previous
experience with such as tool, as he felt very comfortable with manipulating the
visualization. As keeping the data between the two tools synchronized is
difficult, the analysts would often use the visualization as a means for
exploration and discovery, and then use the Excel file as a way to “quantify
and reconfirm” what they saw. One analyst kept a separate “note file”, where
interesting information was pasted from time to time.
However, moving their investigation from a textual,
query-driven analysis within Excel to a visual one in Spotfire did not come
easy to most analysts. Often, when we would point out something to them within
the visualization, they would glance at it, and then move directly back to
Excel and continue their work there. There was a clear distrust of visualizations
by cyber analysts. Our post-study interview further revealed some of their
thoughts on visualizations. Some commented that visualizations “hide the data”
due to aggregation algorithms, others claimed they were unable to “save states
of what [they were] working on”, causing them to be very tentative with their
visual exploration due to fear of corrupting the current state of their
investigation.
Figure 3. A screenshot showing the use of
multiple views of the data through multiple instances of Spotfire and Excel
running simultaneously. The analyst has the employee data, prox data, IP log
data, task description, and a notes file fully visible. Notice that the
majority of the interactions occur within the tools (e.g. creation of pivot
tables within Excel) and do not utilize the additional screen space well.
Based on past research, we hypothesized the added display
space of the large, high-resolution display would be used for: (1) showing
multiple views of the data (e.g. many windows open, showing different aspects
of the same data) shown in Figure
3, or (2) showing more detail of a single view
(e.g. one maximized detailed visualization) shown in Figure
4. Both of these uses were observed at times
throughout the study. However, we believe their use of the space could be much improved
for three main reasons. First, based on their responses from the interviews,
none of the analysts were familiar with using a display such as this. We have
found that in order to get comfortable with such a display, one needs to use it
for a week or more, not merely two hours. Second, there was a learning curve
associated with the visualization tool, causing three out of four analysts to
shy away from performing their investigation visually. Again, we would guide
them and show them the basics of the tool to get them started, but feel they
were not fully comfortable with the tool to spread out their workspace across
multiple instances of Spotfire to achieve “multiple views of the data”. We only
saw one analyst doing so, based heavily on our help with setting up the
instances of the tool for him. Third, the tools (both the Spotfire and Excel)
did not allow for a proper use of the added displays space. The analysts’
interactions (i.e. their “work”) was not captured and represented in the form
of either larger, more detailed views (2) or extra views (1).
Figure 4. A screenshot showing an instance
when an analyst maximized the visualization of the combined IP and prox data
across all eight screens. This allowed for a fully-detailed visualization,
without the aggregation which may hide data on smaller displays.
A critical point in
every analyst’s investigation occurred when they made the connection between
the prox and IP data. The “synchronization” of this data, a way to combine the
prox information with the network data from each employee, provided a way to
easily visualize where an employee is located when their assigned IP is
actively sending information. However, as we found out in the interviews, cyber
analysts are unaccustomed to aggregating heterogeneous data sources like this. Their job often deals with solely the network data
instead of a collection of data which ties together. Some remarked that this
method “had never occurred to them”. The resulting visualization of the
combined data can be seen in Figure
2. We believe
this realization was brought about by the persistence of the data and the
corresponding views. The analysts were often observed switching between the different
data during their investigation. The actual switching was simple, performed by
turning their head or rotating their chair instead of accessing the task bar or
re-arranging windows. Two of the four analysts were able to make this
connection within the first hour of working with the data. For the other two,
we guided them to consider this connection in the interest of time. For all of
the analysts, we provided them with the pre-combined Excel file and
corresponding visualization, as we did not want them spending time on this aspect
of the challenge.
When seeing the combined data visually, the analysts were
immediately able to recognize new aspects of the data. We received comments saying the data became
“easier to work with”, due to the concept of an employee’s regular schedule
becoming clear, including the context relative to the rest of the employees
schedules. The concept of time became clear (5 weeks, each with 5 work days,
and 2 days for the weekend), and other insights including the critical one:
questioning what the “blue dots” mean. The “blue dots” represent an instance
when an employee’s IP is actively sending information over the network while
they are prox’ed into the classified area – an occurrence which should never
happen.
We urged the analysts to continue working within the
visualization, providing assistance on how to manipulate the visualization to
show what they wanted to see. As the analysts continued down their pursuit of
what the meaning of the “blue dots” is, they arrived at their conclusion: “An
employee’s assigned IP sending information while they are proxed into the
classified area”. When glancing back at the map of the office by a simple turn
of their head, they noticed that there were no computers in the classified
section, and therefore no network traffic should be seen from those employees’
IPs. Upon further visual filtering, they obtained a view of only these
activities (Figure
5), showing them email traffic (port 25), web traffic (port 80), and
other network traffic on port 8080.
Figure 5. A filtered scatterplot showing only network traffic from an employee's assigned IP while they are proxed into the classified area. The data includes email traffic (port 25), web traffic (port 80), and other network traffic (port 8080). The y-axis is the source IP, and the x-axis is time.
The analysts then
became interested in where this information was going. At this point they each
seemed to become excited at the fact they were able to tremendously narrow down
the amount of data being shown. They changed their y-axis to represent the
destination IP (Figure
6).
Figure 6. A filtered scatterplot showing only network traffic from an employee’s assigned IP while they are proxed into the classified area. The y-axis is the destination IP, and the x-axis is time.
Seeing all of the
destination IPs arranged like this, and obtaining the details on demand by
highlighting them, it became clear from here that the information was being
sent to a single destination IP (100.59.151.133) from a collection of employee
IPs (37.170.100.15, 37.170.100.16, 37.170.100.31, 37.170.100.41, 37.170.100.52,
37.170.100.56). Table
1 shows the
traffic each of the analysts found to be sending out information. In addition,
as the analysts became more experienced and comfortable with the visualization,
they were able to begin quantifying their results within the visualization
rather than referring back to the Excel file.
Table 1. Suspicious outgoing network traffic.
USER WARNING |
SourceIP |
AccessTime |
DestIP |
Socket |
ReqSize |
RespSize |
Synthetic Data |
37.170.100.15 |
2008-01-31T13:10:23.841 |
100.59.151.133 |
8080 |
9064720 |
11238 |
Synthetic Data |
37.170.100.16 |
2008-01-10T16:01:53.956 |
100.59.151.133 |
8080 |
8543125 |
12312 |
Synthetic Data |
37.170.100.16 |
2008-01-15T16:14:34.563 |
100.59.151.133 |
8080 |
6773214 |
24661 |
Synthetic Data |
37.170.100.31 |
2008-01-10T14:27:12.238 |
100.59.151.133 |
8080 |
6543216 |
22315 |
Synthetic Data |
37.170.100.41 |
2008-01-17T12:12:10.990 |
100.59.151.133 |
8080 |
3679122 |
24423 |
Synthetic Data |
37.170.100.41 |
2008-01-29T16:08:10.892 |
100.59.151.133 |
8080 |
6752212 |
57865 |
Synthetic Data |
37.170.100.52 |
2008-01-31T09:41:03.815 |
100.59.151.133 |
8080 |
5579339 |
22147 |
Synthetic Data |
37.170.100.56 |
2008-01-29T15:41:32.763 |
100.59.151.133 |
8080 |
10024754 |
29565 |
The four professional cyber security analysts performed
the task well. Although they were all reluctant to use the visualization at
first, with our help they were each able to find the solution. After the study,
they remarked how working within a visualization provided them with
“interesting findings” much quicker than working within the raw data. Using the
large, high-resolution display to keep all of the data visible at all times, as
well as enlarge the visualization when added detail was needed, they were able
to utilize the added display space by drawing connections between the different
types of data, and ultimately led each of them to their solution. We feel that
with proper design of future visualizations to take advantage of the added
display space, cyber analytics can benefit from their inherent advantage of
showing connections one would otherwise overlook.